{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "49c3e85c",
   "metadata": {},
   "source": [
    "| [01_word_embedding/01_文本表示.ipynb](https://github.com/shibing624/nlp-tutorial/blob/main/01_word_embedding/01_%E6%96%87%E6%9C%AC%E8%A1%A8%E7%A4%BA.ipynb)  | 文本向量表示  |[Open In Colab](https://colab.research.google.com/github/shibing624/nlp-tutorial/blob/main/01_word_embedding/01_%E6%96%87%E6%9C%AC%E8%A1%A8%E7%A4%BA.ipynb) |\n",
    "\n",
    "\n",
    "# 文本表示\n",
    "\n",
    "文本表示：将文本数据表示成计算机能够运算的数字或向量。\n",
    "\n",
    "在自然语言处理（Natural Language Processing，NLP）领域，文本表示是处理流程的第一步，主要是将文本转换为计算机可以运算的数字。\n",
    "\n",
    "这里多提一下，并不是所有人工智能（AI）模型都要做表示转换的，如计算机视觉（Compute Vision，CV）的图像识别，因为图片存储本身就是数字化的，所以直接用像素值处理就可以，这也是做CV的同学转NLP要注意的一点。\n",
    "\n",
    "文本表示有两大类方法：\n",
    "1. 离散表示\n",
    "2. 分布表示"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1474c849",
   "metadata": {},
   "source": [
    "# 离散表示\n",
    "\n",
    "## 独热编码（One-hot）\n",
    "\n",
    "- 思想：\n",
    "\n",
    "  将语料库中所有的词拉成一个向量，给每个词一个下标，就得到对应的词典。每个分词的文本表示为该分词的比特位为1,其余位为0的矩阵表示。\n",
    "\n",
    "## 词袋模型(Bag of Words)\n",
    "\n",
    "- 思想：\n",
    "\n",
    "  把每篇文章看成一袋子词，并忽略每个词出现的顺序。具体来看：将整段文本表示成一个长向量，每一维代表一个单词。该维对应的权重代表这个词在原文章中的重要程度。\n",
    "\n",
    "- 例子1：\n",
    "\n",
    "  句1：Jane wants to go to Shenzhen 句2：Bob wants to go to Shanghai\n",
    "\n",
    "  使用两个例句来构造词袋： [Jane, wants, to, go, Shenzhen, Bob, Shanghai]\n",
    "\n",
    "  两个例句就可以用以下两个向量表示，对应的下标与映射数组的下标相匹配，其值为该词语出现的次数\n",
    "\n",
    "  句1：[1,1,2,1,1,0,0]  句2：[0,1,2,1,0,1,1]\n",
    "\n",
    "- 例子2：\n",
    "\n",
    "  这次我们加上停用词和标点符号的处理，\n",
    "\n",
    "  句1：Jane wants to go to Shenzhen . 句2：Bob wants to go to Shanghai , me too ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3537d7b8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "['Jane', 'wants', 'to', 'go', 'to', 'Shenzhen', '.']"
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sentence1 = 'Jane wants to go to Shenzhen .'\n",
    "sentence2 = 'Bob wants to go to Shanghai , me too .'\n",
    "\n",
    "tokens1 = sentence1.split(\" \")\n",
    "tokens2 = sentence2.split(\" \")\n",
    "tokens1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f98cb9c3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "['Jane',\n 'wants',\n 'to',\n 'go',\n 'Shenzhen',\n '.',\n 'Bob',\n 'Shanghai',\n ',',\n 'me',\n 'too']"
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def vectorize(tokens, filtered_vocab):\n",
    "    \"\"\"\n",
    "    向量化\n",
    "    \"\"\"\n",
    "    vector = []\n",
    "    for w in filtered_vocab:\n",
    "        vector.append(tokens.count(w))\n",
    "    return vector\n",
    "\n",
    "def unique(sequence):\n",
    "    \"\"\"\n",
    "    去重\n",
    "    \"\"\"\n",
    "    seen = set()\n",
    "    return [x for x in sequence if not (x in seen or seen.add(x))]\n",
    "\n",
    "# 停用词\n",
    "stopwords = [\"to\", \"is\", \"a\"]\n",
    "# 标点符号\n",
    "special_chars = [\",\", \":\", \";\", \".\", \"?\"]\n",
    "\n",
    "# create a vocabulary list\n",
    "vocab = unique(tokens1+tokens2)\n",
    "vocab"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8651895",
   "metadata": {},
   "source": [
    "使用两个例句的tokens，过滤停用词和标点符号后来构造有效词袋："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a8f6046a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "['Jane', 'wants', 'go', 'Shenzhen', 'Bob', 'Shanghai', 'me', 'too']"
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 过滤停用词和标点符号\n",
    "filtered_vocab = []\n",
    "for w in vocab: \n",
    "    if w not in stopwords and w not in special_chars: \n",
    "        filtered_vocab.append(w)\n",
    "filtered_vocab"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad829a30",
   "metadata": {},
   "source": [
    "两个例句的向量表示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "04ddcc82",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[1, 1, 1, 1, 0, 0, 0, 0]\n",
      "[0, 1, 1, 0, 1, 1, 1, 1]\n"
     ]
    }
   ],
   "source": [
    "# convert sentences into vectords\n",
    "vector1 = vectorize(tokens1, filtered_vocab)\n",
    "print(vector1)\n",
    "vector2 = vectorize(tokens2, filtered_vocab)\n",
    "print(vector2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1eb079bd",
   "metadata": {},
   "source": [
    "Bag of Words模型向量的size就是vocabulary的size大小，所以该向量表示非常稀疏。\n",
    "\n",
    "\n",
    "下面演示使用sklearn库做Bag of Words模型："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fc79fa5c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "   bob  go  jane  me  shanghai  shenzhen  to  too  wants\n0    0   1     1   0         0         1   2    0      1\n1    1   1     0   1         1         0   2    1      1",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>bob</th>\n      <th>go</th>\n      <th>jane</th>\n      <th>me</th>\n      <th>shanghai</th>\n      <th>shenzhen</th>\n      <th>to</th>\n      <th>too</th>\n      <th>wants</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>2</td>\n      <td>0</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n      <td>1</td>\n      <td>1</td>\n      <td>0</td>\n      <td>2</td>\n      <td>1</td>\n      <td>1</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
    " \n",
    "sentence1 = 'Jane wants to go to Shenzhen .'\n",
    "sentence2 = 'Bob wants to go to Shanghai , me too .'\n",
    "  \n",
    "count_vec = CountVectorizer(ngram_range=(1, 1), # to use bigrams ngram_range=(2,2)\n",
    "                           #stop_words='english'\n",
    "                           )\n",
    "#transform\n",
    "feature = count_vec.fit_transform([sentence1, sentence2])\n",
    " \n",
    "#create dataframe\n",
    "df = pd.DataFrame(feature.toarray(), columns=count_vec.get_feature_names())\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1afc1c73",
   "metadata": {},
   "source": [
    "\n",
    "## 词频-逆向文件频率（TF-IDF）\n",
    "\n",
    "- 思想：\n",
    "\n",
    "  字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。如果某个单词在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。\n",
    "\n",
    "- 公式：\n",
    "\n",
    "  - $TF-IDF(t,d)=TF(t,d) × IDF(t)$\n",
    "  - $IDF(t)=log\\frac {文章总数} {包含单词t的文章总数+1}$\n",
    "  - $TF=\\frac{单词t在文档中出现的次数}{该文档的总词量}$\n",
    "\n",
    "- 缺点：\n",
    "\n",
    "  （1）没有考虑特征词的位置因素对文本的区分度，词条出现在文档的不同位置时，对区分度的贡献大小是不一样的。\n",
    "\n",
    "  （2）按照传统TF-IDF，往往一些生僻词的IDF(反文档频率)会比较高、因此这些生僻词常会被误认为是文档关键词。\n",
    "\n",
    "  （3）IDF部分只考虑了特征词与它出现的文本数之间的关系，而忽略了特征项在一个类别中不同的类别间的分布情况。\n",
    "\n",
    "  （4）对于文档中出现次数较少的重要人名、地名信息提取效果不佳。\n",
    "\n",
    "使用示例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "128da3b6",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Dumping model to file cache /var/folders/my/wtb78vc53sv_jjmg5pk2j1_c0000gn/T/jieba.cache\n",
      "Loading model cost 0.797 seconds.\n",
      "Prefix dict has been built successfully.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.         0.         0.         0.         0.         0.\n",
      "  0.         0.57735027 0.57735027 0.         0.57735027 0.        ]\n",
      " [0.33333333 0.33333333 0.33333333 0.33333333 0.33333333 0.33333333\n",
      "  0.33333333 0.         0.         0.33333333 0.         0.33333333]]\n"
     ]
    },
    {
     "data": {
      "text/plain": "         上海         也         去        张三         很         我         是  \\\n0  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   \n1  0.333333  0.333333  0.333333  0.333333  0.333333  0.333333  0.333333   \n\n        李四       深圳         爱       爱去         ，  \n0  0.57735  0.57735  0.000000  0.57735  0.000000  \n1  0.00000  0.00000  0.333333  0.00000  0.333333  ",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>上海</th>\n      <th>也</th>\n      <th>去</th>\n      <th>张三</th>\n      <th>很</th>\n      <th>我</th>\n      <th>是</th>\n      <th>李四</th>\n      <th>深圳</th>\n      <th>爱</th>\n      <th>爱去</th>\n      <th>，</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0.000000</td>\n      <td>0.000000</td>\n      <td>0.000000</td>\n      <td>0.000000</td>\n      <td>0.000000</td>\n      <td>0.000000</td>\n      <td>0.000000</td>\n      <td>0.57735</td>\n      <td>0.57735</td>\n      <td>0.000000</td>\n      <td>0.57735</td>\n      <td>0.000000</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0.333333</td>\n      <td>0.333333</td>\n      <td>0.333333</td>\n      <td>0.333333</td>\n      <td>0.333333</td>\n      <td>0.333333</td>\n      <td>0.333333</td>\n      <td>0.00000</td>\n      <td>0.00000</td>\n      <td>0.333333</td>\n      <td>0.00000</td>\n      <td>0.333333</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import jieba\n",
    "sentence1 = '李四爱去深圳'\n",
    "sentence2 = '张三很爱去上海，我也是'\n",
    "contents = [sentence1, sentence2]\n",
    "# 参数为 CounterVectorizer 和 TfidfTransformer 的所有参数\n",
    "vec = TfidfVectorizer(tokenizer=jieba.lcut,\n",
    "                      stop_words=stopwords,\n",
    "                      norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)\n",
    "feature = vec.fit_transform(contents) #直接对文档进行转换提取tfidf特征\n",
    "#一步就得到了tfidf向量\n",
    "print(feature.toarray())\n",
    "#create dataframe\n",
    "df = pd.DataFrame(feature.toarray(), columns=vec.get_feature_names())\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2edd776c",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## N-Gram模型（统计语言模型）\n",
    "\n",
    "- 统计语言模型：\n",
    "\n",
    "  是一个基于概率的判别模型。统计语言模型把语言（词的序列）看作一个随机事件，并赋予相应的概率来描述其属于某种语言集合的可能性。\n",
    "    给定一个词汇集合 V，对于一个由 V 中的词构成的序列S = ⟨w1, · · · , wT ⟩ ∈ Vn，统计语言模型赋予这个序列一个概率P(S)，\n",
    "    来衡量S 符合自然语言的语法和语义规则的置信度。用一句简单的话说，统计语言模型就是计算一个句子的概率大小的这种模型。\n",
    "\n",
    "- 思想：\n",
    "\n",
    "  N-Gram是一种基于统计语言模型的算法。它的基本思想是将文本里面的内容按照字节进行大小为N的滑动窗口操作，形成了长度是N的字节片段序列。\n",
    "    每一个字节片段称为gram，对所有gram的出现频度进行统计，并且按照事先设定好的阈值进行过滤，形成关键gram列表，也就是这个文本的向\n",
    "    量特征空间，列表中的每一种gram就是一个特征向量维度。把这些生成一个字典，按照词袋模型的方式进行编码得到结果。\n",
    "\n",
    "- 例子：\n",
    " "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0457875c",
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence1 = 'John likes to watch movies. Mary likes too'\n",
    "sentence2 = 'John also likes to watch football games.'\n",
    "tokens1 = sentence1.split()\n",
    "tokens2 = sentence2.split()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "1f957242",
   "metadata": {},
   "outputs": [],
   "source": [
    "# 构建ngram\n",
    "def create_ngram_set(input_list, ngram_value=2):\n",
    "    \"\"\"\n",
    "    Create a set of n-grams\n",
    "    :param input_list: [1, 2, 3, 4, 9]\n",
    "    :param ngram_value: 2\n",
    "    :return: {(1, 2),(2, 3),(3, 4),(4, 9)}\n",
    "    \"\"\"\n",
    "    return set(zip(*[input_list[i:] for i in range(ngram_value)]))\n",
    "\n",
    "\n",
    "def add_ngram(sequences, token_indice, ngram_range=2):\n",
    "    \"\"\"\n",
    "    Augment the input list by appending n-grams values\n",
    "    :param sequences:\n",
    "    :param token_indice:\n",
    "    :param ngram_range:\n",
    "    :return:\n",
    "    \"\"\"\n",
    "    new_seq = []\n",
    "    for input in sequences:\n",
    "        new_list = input[:]\n",
    "        for i in range(len(new_list) - ngram_range + 1):\n",
    "            for ngram_value in range(2, ngram_range + 1):\n",
    "                ngram = tuple(new_list[i:i + ngram_value])\n",
    "                if ngram in token_indice:\n",
    "                    new_list.append(token_indice[ngram])\n",
    "        new_seq.append(new_list)\n",
    "    return new_seq"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c3b39147",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": "{('John', 'also'),\n ('John', 'likes'),\n ('Mary', 'likes'),\n ('also', 'likes'),\n ('football', 'games.'),\n ('likes', 'to'),\n ('likes', 'too'),\n ('movies.', 'Mary'),\n ('to', 'watch'),\n ('too', 'John'),\n ('watch', 'football'),\n ('watch', 'movies.')}"
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "create_ngram_set(tokens1 + tokens2, ngram_value=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c72243b",
   "metadata": {},
   "source": [
    "构造字典：\n",
    "\n",
    "```\n",
    "  {\"John likes”: 1, \"likes to”: 2, \"to watch”: 3, \"watch movies”: 4, \"Mary likes”: 5, \"likes too”: 6, \"John also”: 7, \"also likes”: 8, “watch football”: 9, \"football games\": 10}\n",
    "```\n",
    "此时，第一句的向量表示为：**[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]**，其中第一个1表示**John likes**在该句中出现了1次，依次类推。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "83d19364",
   "metadata": {},
   "source": [
    "## 离散表示的缺点\n",
    "\n",
    "- 无法衡量词向量之间的关系。\n",
    "- 词表的维度随着语料库的增长而膨胀。\n",
    "- n-gram词序列随语料库增长呈指数型膨胀，更加快。\n",
    "- 离散数据来表示文本会带来数据稀疏问题，导致丢失了信息，与我们生活中理解的信息是不一样的。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7cf5a65",
   "metadata": {},
   "source": [
    "# 分布式表示\n",
    "\n",
    "主要思想是**用周围的词表示该词**\n",
    "\n",
    "## 共现矩阵(Cocurrence matrix)\n",
    "\n",
    "- 思想：\n",
    "\n",
    "  “共现”，即共同出现，如一句话中共同出现，或一篇文章中共同出现。这里给共同出现的距离一个规范——窗口，如果窗口宽度是2，那就是在当前词的前后各2个词的范围内共同出现。可以想象，其实是一个总长为5的窗口依次扫过所有文本，同时出现在其中的词就说它们共现。\n",
    "\n",
    "- 例子：\n",
    "\n",
    "  ![image-20210117155944299](http://image.wonkers.cn//image-20210117155944299.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "236f493c",
   "metadata": {},
   "source": [
    "\n",
    "## 神经网络语言模型（NNLM）\n",
    "\n",
    "- 思想：\n",
    "\n",
    "  - NNLM是从语言模型出发(即计算概率角度)，构建神经网络针对目标函数对模型进行最优化，训练的起点是使用神经网络去搭建语言模型实现词的预测任务，并且在优化过程后模型的副产品就是词向量。\n",
    "\n",
    "  - 进行神经网络模型的训练时，目标是进行词的概率预测，就是在词环境下，预测下一个该是什么词，目标函数如下式, 通过对网络训练一定程度后，最后的模型参数就可当成词向量使用。\n",
    "  - 最后关心的并不是输出层的预测概率，而是通过BP+SGD得到的中间产物：最优投影矩阵C，将其作为文本表示矩阵。\n",
    "\n",
    "- 概率函数：$f(w_{t},w_{t-1},...,w_{t-n+2}, w_{t-n+1})=p(w_{t} | {w_{1}}^{t-1})$\n",
    "\n",
    "- 目标函数：![img](http://image.wonkers.cn//5012681-dfd2deb8955da0cb.png)\n",
    "\n",
    "  - 约束条件：![image-20210118145038303](http://image.wonkers.cn//image-20210118145038303.png)\n",
    "\n",
    "- 训练过程就是学习θ的最大似然, 其中R(θ) 是正则项。\n",
    "\n",
    "- 模型结构：\n",
    "\n",
    "<img src=\"http://image.wonkers.cn//945696-20170901170825312-1330533346.png\" alt=\"img\" style=\"zoom:80%;\" />\n",
    "\n",
    "- 模型分为两部分：特征映射和计算条件概率分布\n",
    "\n",
    "  - 特征映射：对应结构图中最底部的紫色虚线Matrix C\n",
    "\n",
    "    - 目的是进行特征降维，结果是将字典V中的单词onehot特征表示投影转换到稠密词向量表示，作为NNLM的输入。\n",
    "\n",
    "    <img src=\"http://image.wonkers.cn//2020-2-19_17-18-49.png\" alt=\"img\" style=\"zoom:80%;\" />\n",
    "\n",
    "  - 计算条件概率分布：经过神经网络的输入层隐藏层后，经softmax做归一化计算概率得到输出层\n",
    "\n",
    "    ![image-20210118145156054](http://image.wonkers.cn//image-20210118145156054.png)\n",
    "\n",
    "- 神经网络结构\n",
    "\n",
    "![img](http://image.wonkers.cn//v2-35870dc7d2e97e2844f9c3aad72a5fb0_720w.jpg)\n",
    "\n",
    "![image-20210118145332072](http://image.wonkers.cn//image-20210118145332072.png)\n",
    "\n",
    "直连矩阵W可以加快模型训练速度，但对效果提升不大。直连可以合并词向量不经过隐含层，直接右乘直连矩阵 W 得到 $v \\times 1$ 维输出后与前述的 $v \\times 1$ 维输出向相加，得到一个最终的 $v \\times 1$ 维输出向量。\n",
    "\n",
    "- 模型训练\n",
    "\n",
    "![image-20210118145411203](http://image.wonkers.cn//image-20210118145411203.png)\n",
    "\n",
    "- 流程梳理\n",
    "\n",
    "![image-20210118150555438](http://image.wonkers.cn//image-20210118150555438.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17e929c9",
   "metadata": {},
   "source": [
    "## Word2Vec\n",
    "\n",
    "- CBOW\n",
    "\n",
    "  - 获得中间词两边的的上下文，然后用周围的词去预测中间的词，把中间词当做y，把窗口中的其它词当做x输入，x输入是经过one-hot编码过的，然后通过一个隐层进行求和操作，最后通过激活函数softmax，可以计算出每个单词的生成概率，接下来的任务就是训练神经网络的权重，使得语料库中所有单词的整体生成概率最大化，而求得的权重矩阵就是文本表示词向量的结果。\n",
    "  - 与NNLM的联系：\n",
    "    - 移除前向反馈神经网络中非线性的hidden layer，直接将中间层的Embedding layer与输出层的softmax layer连接；\n",
    "    - 忽略上下文环境的序列信息：输入的所有词向量均汇总到同一个Embedding layer；\n",
    "    - 将Future words纳入上下文环境\n",
    "  - 模型结构\n",
    "\n",
    "  <img src=\"http://image.wonkers.cn//2020-2-19_17-29-38.png\" alt=\"img\" style=\"zoom:50%;\" />\n",
    "\n",
    "  - 流程梳理\n",
    "\n",
    "  ![image-20210118164239883](http://image.wonkers.cn//image-20210118164239883.png)\n",
    "\n",
    "- Skip-Gram\n",
    "  - 通过当前词来预测窗口中上下文词出现的概率模型，把当前词当做x，把窗口中其它词当做y，依然是通过一个隐层接一个Softmax激活函数来预测其它词的概率。\n",
    "  - Skip-gram模型的本质是**计算输入word的input vector与目标word的output vector之间的余弦相似度，并进行softmax归一化**。我们要学习的模型参数正是这两类词向量。\n",
    "\n",
    "- 优化tricks\n",
    "\n",
    "  - 层次Softmax\n",
    "\n",
    "    - 本质是把 N 分类问题变成 log(N)次二分类\n",
    "    - hierarchical softmax 使用一颗二叉树表示词汇表中的单词，每个单词都作为二叉树的叶子节点。对于一个大小为V的词汇表，其对应的二叉树包含V-1非叶子节点。假如每个非叶子节点向左转标记为1，向右转标记为0，那么每个单词都具有唯一的从根节点到达该叶子节点的由｛0 1｝组成的代号（实际上为哈夫曼编码，为哈夫曼树，是带权路径长度最短的树，哈夫曼树保证了词频高的单词的路径短，词频相对低的单词的路径长，这种编码方式很大程度减少了计算量）。\n",
    "    - 使用Huffman Tree来编码输出层的词典，相当于平铺到各个叶子节点上，**瞬间把维度降低到了树的深度**，可以看如下图所示。这课Tree把出现频率高的词放到靠近根节点的叶子节点处，每一次只要做二分类计算，计算路径上所有非叶子节点词向量的贡献即可。\n",
    "    - <img src=\"http://image.wonkers.cn//2020-2-19_17-30-36.png\" alt=\"img\" style=\"zoom:80%;\" />\n",
    "    - ![image-20210118165233683](http://image.wonkers.cn//image-20210118165233683.png)\n",
    "\n",
    "  - 负例采样（Negative Sampling）\n",
    "\n",
    "    - 在正确单词以外的负样本中进行采样，最终目的是为了减少负样本的数量，达到减少计算量效果。将词典中的每一个词对应一条线段，所有词组成了[0，1］间的剖分，如下图所示，然后每次随机生成一个[1, M-1]间的整数，看落在哪个词对应的剖分上就选择哪个词，最后会得到一个负样本集合。\n",
    "\n",
    "    - 如果 vocabulary 大小为10000时， 当输入样本 ( \"fox\", \"quick\") 到神经网络时， “ fox” 经过 one-hot 编码，在输出层我们期望对应 “quick” 单词的那个神经元结点输出 1，其余 9999 个都应该输出 0。在这里，这9999个我们期望输出为0的神经元结点所对应的单词我们为 negative word. negative sampling 的想法也很直接 ，将随机选择一小部分的 negative words，比如选 10个 negative words 来更新对应的权重参数。\n",
    "\n",
    "      假设原来模型每次运行都需要300×10,000(其实没有减少数量，但是运行过程中，减少了需要载入的数量。) 现在只要300×(1+10)减少了好多。\n",
    "\n",
    "    - 选择negative samples：常出现的高频词有更大的概率被选为负例。直接基于词频的权重分布获得概率分布进行抽样\n",
    "\n",
    "    ![image-20210118170045700](http://image.wonkers.cn//image-20210118170045700.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4aacf4e0",
   "metadata": {},
   "source": [
    "## Glove\n",
    "\n",
    "**GloVe的全称叫Global Vectors for Word Representation，它是一个基于全局词频统计（count-based & overall statistics）的词表征（word representation）工具，它可以把一个单词表达成一个由实数组成的向量，这些向量捕捉到了单词之间一些语义特性，比如相似性（similarity）、类比性（analogy）等。**我们通过对向量的运算，比如欧几里得距离或者cosine相似度，可以计算出两个单词之间的语义相似性。\n",
    "\n",
    "实现步骤\n",
    "\n",
    "- 构建共现矩阵\n",
    "\n",
    "  根据语料库（corpus）构建一个共现矩阵（Co-ocurrence Matrix）X，**矩阵中的每一个元素 Xij 代表单词 i 和上下文单词 j 在特定大小的上下文窗口（context window）内共同出现的次数。**一般而言，这个次数的最小单位是1，但是GloVe不这么认为：它根据两个单词在上下文窗口的距离 d，提出了一个衰减函数（decreasing weighting）：decay=1/d 用于计算权重，也就是说**距离越远的两个单词所占总计数（total count）的权重越小**。\n",
    "\n",
    "- 构建词向量和共现矩阵之间的近似关系\n",
    "\n",
    "  - $w_{i}^{T}\\tilde{w_{j}} + b_i + \\tilde{b_j} = \\log(X_{ij}) \\tag{1}$\n",
    "  - 其中，$w_{i}^{T}$和$\\tilde{w_{j}}$是我们最终要求解的词向量；$b_i$和$\\tilde b_j$分别是两个词向量的bias term。\n",
    "\n",
    "- 构建损失函数\n",
    "\n",
    "  $J = \\sum_{i,j=1}^{V} f(X_{ij})(w_{i}^{T}\\tilde{w_{j}} + b_i + \\tilde{b_j} – \\log(X_{ij}) )^2 \\tag{2}$\n",
    "\n",
    "![image-20210118190523016](http://image.wonkers.cn//image-20210118190523016.png)\n",
    "\n",
    "- 流程梳理\n",
    "\n",
    "![image-20210118190638818](http://image.wonkers.cn//image-20210118190638818.png)\n",
    "\n",
    "- Glove与LSA、word2vec的比较\n",
    "\n",
    "  LSA（Latent Semantic Analysis）是一种比较早的count-based的词向量表征工具，它也是基于co-occurance matrix的，只不过采用了基于奇异值分解（SVD）的矩阵分解技术对大矩阵进行降维，而我们知道SVD的复杂度是很高的，所以它的计算代价比较大。还有一点是它对所有单词的统计权重都是一致的。而这些缺点在GloVe中被一一克服了。而word2vec最大的缺点则是没有充分利用所有的语料，所以GloVe其实是把两者的优点结合了起来。从这篇论文给出的实验结果来看，GloVe的性能是好于LSA和word2vec的，但也有说GloVe和word2vec实际表现其实差不多。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44a4532d",
   "metadata": {},
   "source": [
    "本节内容侧重理论介绍，后面用例子演示词向量模型的训练。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac342eca",
   "metadata": {},
   "source": [
    "本节完。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "019b994f",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}